林嶔 (Lin, Chin)
Lesson 17
– 讓我們先複習一下linear regression的求解公式推導過程:
先定義我們的預測式(y = b0 + b1 * x)
再定義我們的求解目標(希望讓殘差平方最小化)
偏微分求解殘差平方方程(了解極小值出現的位置)
讓我們先忘掉線性迴歸及最小平方法(最小平方法是18世紀發展出來的),在這裡我們先回到統計學的原始公理(統計學是19世紀末的學科),想想統計學如何描述母群以及樣本。
我們先用一個最簡單的例子,假設母群疾病盛行率為p,而我們對母群做一次抽樣(樣本數 = 80),我們在這次抽樣中抽出了49個有病的人,31個沒有病的人…
– 先考慮一個比較簡單的問題,假定台北該疾病盛行率為1/3,台中盛行率為1/2,高雄盛行率為2/3,請問我們這次的抽樣最有可能是在哪個城市呢?
– 當我們思考到這步時,試著算出在台北抽出49個有病及31個沒病的機率吧(請用二項分布計算,但注意二項分布的基本假設):
Taipei.p = choose(80, 49) * (1/3)^49 * (1 - 1/3)^31
print(Taipei.p)
## [1] 2.078886e-07
– 接著算出在台中及高雄抽樣的樣本機率吧吧:
Taichung.p = choose(80, 49) * (1/2)^49 * (1 - 1/2)^31
print(Taichung.p)
## [1] 0.0118359
Kaohsiung.p = choose(80, 49) * (2/3)^49 * (1 - 2/3)^31
print(Kaohsiung.p)
## [1] 0.05449674
sample.p = function (p) {
choose(80, 49) * (p)^49 * (1 - p)^31
}
sample.p(0.1)
## [1] 5.459073e-29
sample.p(0.3)
## [1] 5.402339e-09
sample.p(0.5)
## [1] 0.0118359
sample.p(0.7)
## [1] 0.02270722
sample.p(0.9)
## [1] 8.193775e-12
– 有了這樣的概念後,該怎樣求解呢?
seq.p = seq(0, 1, by = 0.001)
result = numeric(length(seq.p))
for (i in 1:length(seq.p)) {
result[i] = sample.p(seq.p[i])
}
which.max(result)
## [1] 613
seq.p[which.max(result)]
## [1] 0.612
plot(seq.p, result, type = "l", xlab = "母群參數", ylab = "樣本機率")
結果發現,當p為0, 1, 49/80時有極值,但0, 1是出現極小值,而49/80是極大值,故答案為49/80
我們知道大家數學都不好,所以要解這種問題我們在R裡面可以請他幫我們解。
– 我們需要使用套件「stats4」內的函數「mle」(但她只能求最小值出現的位置,不能求最大值,所以要改寫我們的sample.p函數):
– 請注意,函數「mle」並非使用數學上的微分求解,他使用的方式我們會在下一節課再詳細介紹!
library(stats4)
sample.p = function (p) {
-choose(80, 49) * (p)^49 * (1 - p)^31
}
fit = mle(sample.p, start = list(p = 0.5), method = "SANN")
fit
##
## Call:
## mle(minuslogl = sample.p, start = list(p = 0.5), method = "SANN")
##
## Coefficients:
## p
## 0.6124266
49/80
## [1] 0.6125
x = c(1, 7, 5, 6, 8, 3, 2, 9, 4, 5, 3)
– 你可能不清楚該怎麼求得常態分佈的機率,可以試著使用函數「dnorm」:
dnorm(0, mean = 0, sd = 1)
## [1] 0.3989423
dnorm(0, mean = 0, sd = 2)
## [1] 0.1994711
dnorm(0, mean = 1, sd = 2)
## [1] 0.1760327
dnorm(1, mean = 0, sd = 2)
## [1] 0.1760327
– 請與直接計算做比較
mean(x)
## [1] 4.818182
sd(x)
## [1] 2.522625
先定義我們的預測式(y = b0 + b1 * x)
再定義我們的求解目標(希望讓樣本機率最大化)
– 這裡要注意一點,因為抽到每個個案的機率相當低,之後還要累乘會變得更小,可能會小到電腦無法紀錄,因此把它做對數轉換能有效解決這個問題!
– 另外,本來機率是累乘,現在取完對數後變成累加!
x = c(1, 2, 3, 4, 5)
y = c(6, 7, 9, 8, 10)
linear.p = function(b0, b1) {
y.hat = b0 + b1 * x
res = y - y.hat
mean.res = 0
sd.res = sd(res)
log_p.res = dnorm(res, mean = mean.res, sd = sd.res, log = TRUE)
return(-sum(log_p.res))
}
fit1 = mle(linear.p, start = list(b0 = 0, b1 = 0), method = "SANN")
fit1
##
## Call:
## mle(minuslogl = linear.p, start = list(b0 = 0, b1 = 0), method = "SANN")
##
## Coefficients:
## b0 b1
## 5.3000118 0.8987946
fit1
##
## Call:
## mle(minuslogl = linear.p, start = list(b0 = 0, b1 = 0), method = "SANN")
##
## Coefficients:
## b0 b1
## 5.3000118 0.8987946
fit2 = lm(y~x)
fit2
##
## Call:
## lm(formula = y ~ x)
##
## Coefficients:
## (Intercept) x
## 5.3 0.9
vcov(fit2)
## (Intercept) x
## (Intercept) 0.6966667 -0.19000000
## x -0.1900000 0.06333333
slotNames(fit1)
## [1] "call" "coef" "fullcoef" "vcov" "min" "details"
## [7] "minuslogl" "nobs" "method"
slot(fit1, "vcov")
## b0 b1
## b0 0.4370205 -0.11400892
## b1 -0.1140089 0.03800375
fit1@vcov
## b0 b1
## b0 0.4370205 -0.11400892
## b1 -0.1140089 0.03800375
x = rnorm(1000)
y = 3 + 2 * x + rnorm(1000)
fit1 = mle(linear.p, start = list(b0 = 0, b1 = 0), method = "SANN")
fit2 = lm(y~x)
fit1@coef
## b0 b1
## 3.034672 1.981693
fit2$coefficients
## (Intercept) x
## 3.036488 1.983512
fit1@vcov
## b0 b1
## b0 9.528576e-04 -7.396710e-06
## b1 -7.396710e-06 1.004551e-03
vcov(fit2)
## (Intercept) x
## (Intercept) 9.538094e-04 -7.418096e-06
## x -7.418096e-06 1.006548e-03
– 可以轉換成這樣:
x = 0:10
y = c(0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1)
b0 = -3
b1 = 0.5
p = exp(b0 + b1 * x)/(1 + exp(b0 + b1 * x))
p
## [1] 0.04742587 0.07585818 0.11920292 0.18242552 0.26894142 0.37754067
## [7] 0.50000000 0.62245933 0.73105858 0.81757448 0.88079708
– 請與函數「glm」比較結果:
fit3 = glm(y~x, family = "binomial")
fit3
##
## Call: glm(formula = y ~ x, family = "binomial")
##
## Coefficients:
## (Intercept) x
## -1.4719 0.3409
##
## Degrees of Freedom: 10 Total (i.e. Null); 9 Residual
## Null Deviance: 15.16
## Residual Deviance: 12.64 AIC: 16.64
先定義我們的預測式
再定義我們的求解目標
– 請到這裡下載範例資料。
dat = read.csv("data/Patient2.csv")
head(dat)
## t Y
## 1 0.00000000 232.9187
## 2 0.08333333 234.6244
## 3 0.16666667 234.1379
## 4 0.25000000 233.6271
## 5 0.33333333 232.8027
## 6 0.41666667 231.9975
– 根據我們的醫學及物理學知識,我們了解到時間與放射物質殘留量的關係應如下所示:
D_1m是預測值
D_1m(0)是初始值
T是體內半衰期
T2是物理半衰期(固定為175.2)
k是體內/體外放射碘的代謝比例
t是時間
predict.func = function (D_1m.0, T1, T2 = 175.2, k, t) {
D_1m = D_1m.0*((1-k)*exp(-log(2)*t/T1)+k*exp(-log(2)*t/T2))
return(D_1m)
}
predict.func(D_1m.0 = 250, T1 = 3, k = 0.7, t = 0)
## [1] 250
predict.func(D_1m.0 = 250, T1 = 3, k = 0.7, t = 1)
## [1] 233.8366
predict.func(D_1m.0 = 250, T1 = 3, k = 0.7, t = 2)
## [1] 220.8678
X = dat$t[1:350]
Y = dat$Y[1:350]
prob.sample = function(D_1m.0, T1, k) {
pred.Y = predict.func(D_1m.0 = D_1m.0, T1 = T1, k = k, t = X)
res = log(Y) - log(pred.Y)
mean.res = 0
sd.res = sd(res)
log_p.res = dnorm(res, mean = mean.res, sd = sd.res, log = TRUE)
return(-sum(log_p.res, na.rm = TRUE))
}
prob.sample(D_1m.0 = 250, T1 = 5, k = 0.1)
## [1] 575.2135
prob.sample(D_1m.0 = 200, T1 = 3, k = 0.2)
## [1] 1147.613
fit = mle(prob.sample, start = list(D_1m.0 = 250, T1 = 10, k = 0.1), method = "SANN")
fit
##
## Call:
## mle(minuslogl = prob.sample, start = list(D_1m.0 = 250, T1 = 10,
## k = 0.1), method = "SANN")
##
## Coefficients:
## D_1m.0 T1 k
## 243.9800863 9.7114614 0.1224558
NEW.X = dat$t[351:459]
NEW.Y = dat$Y[351:459]
pred.Y = predict.func(D_1m.0 = fit@coef[1],
T1 = fit@coef[2],
T2 = 175.2,
k = fit@coef[3],
t = NEW.X)
sum(log(NEW.Y) - log(pred.Y))^2
## [1] 124.9729
pred.Y_1 = predict.func(D_1m.0 = fit@coef[1],
T1 = fit@coef[2],
T2 = 175.2,
k = fit@coef[3],
t = seq(0, 40, by = 0.1))
plot(dat$t, dat$Y, pch = 19, xlab = "time", ylab = "Y")
lines(seq(0, 40, by = 0.1), pred.Y_1, col = "red")
log.Y = log(Y)
fit_log.lm = lm(log.Y~X)
pred.Y_log = fit_log.lm$coefficients[1] + NEW.X * fit_log.lm$coefficients[2]
sum(log(NEW.Y) - pred.Y_log)^2
## [1] 1.791982
– 畫張圖來看看
pred.Y_2 = exp(fit_log.lm$coefficients[1] + seq(0, 40, by = 0.1) * fit_log.lm$coefficients[2])
plot(dat$t, dat$Y, pch = 19, xlab = "time", ylab = "Y")
lines(seq(0, 40, by = 0.1), pred.Y_1, col = "red")
lines(seq(0, 40, by = 0.1), pred.Y_2, col = "blue")
關係式未必越複雜越好,事實上是越「符合事實」越好,若沒有辦法透徹了解其中關係,有時候以簡馭繁是個不錯的選擇
函數「mle」在最後一個範例中,各位跑出來的結果是否都有差異呢?試著改變起始值你發現 了什麼?
事實上,電腦不可能會幫我們解微分,我們求得的其實是『局部極值』而非『全局極值』,但這個問題在前面的線性迴歸/邏輯斯迴歸都沒有遇到問題,這是因為他們的函數圖形較為簡單,關係式複雜時就很容易產生『局部極值』。
下一節課,我們將會進一步介紹在不會解微分方程時該如何求參數解,但相對應的代價就是我們通常僅能找到『局部極值』而非『全局極值』,回家想想這是重要的問題嗎?